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High education is an important and critical part of education all over the 
world. In last year, the world has been turned increasingly to online 
education due to the outbreak of the Covid-19 pandemic; therefore, 
improving this education system became an urgent matter. Online learning 
systems are a primal environment for acquiring educational data which can 
be from different sources, especially academic institutions. These data can be 
mainly used to analyze and extract utilizable information to help in 
understanding university students’ performance and identifying factors that 
affect it. To extract some meaningful information from these large volumes 
of data, academic organizations must mine the data with high accuracy. In 
this work, three different real datasets were selected, pre-processed, cleaned, 
and filtered for applying support vector machine (SVM) with multilayer 
perceptron kernel (MLP kernel) and optimize its parameters using simulated 
annealing (SA) algorithm to improve the objective function value. While 
examining the search space, SA has the advantage of escaping from local 
minima since it offers the chance for accepting the worse neighbor as a 
solution in a controlled manner. The results show that the designed system 


can determine the best SVM parameters using SA and therefore presents 
better model evaluation. 
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1. INTRODUCTION 

COVID-19 known as a pandemic according to was declared by the World Health Organization 
(WHO) in the past year. This pandemic disruped the education across the globe, as nationwide closures 
forced institutions to temporarily shut down. It is estimated that the closures affected about 70% of the total 
student population worldwide. Data mining algorithms are widely used for discovering the hidden patterns of 
data to help the decision-makers, it became an efficient tool to find the uncovered information from the big 
data., Like business organizations [1], universities are operating today in a highly dynamic and strongly 
competitive environment [2] and the education nowadays is not limited to classroom teaching only but it goes 
away to other forms such systems of online education, web-based education, seminar, project-based learning, 
workshops, etc. Data mining is very important in educational systems as shown in Figure 1, but all these 
systems can not success without accurate evaluation so, for having a successful education system a well- 
defined and accurate evaluation system must be maintained, the prediction of the students’ performance with 
high accuracy is too helpful for selecting the students with low-performance levels from the beginning of 
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learning process. Large volumes of students’ data are collected by modern universities, which are used for 
mproving the educational process. 
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Figure 1. Data mining cycle for educational systems [1] 


Both data mining and machine learning used the same methods. But there is a difference, machine 
learning focuses on prediction, based on known properties, whereas data mining focuses on the identification 
of unknown properties. Support vector machine (SVM) is a machine learning technique that builds a linear 
binary classifier. It defines the decision boundary between two classes [2]. Optimization is the process of 
achieving the best solution for a problem, there are many optimization algorithms like the standard SPSA 
algorithm [3], [4] which used for optimizing systems with multiple unknown parameters, Gradient descent 
which used for finding a local minimum of a differentiable function [5], and simulated annealing (SA) which 
used for approximating the global optimum of a given function in a large search space [6]. Therefore, it was 
chosen for this work. 

SA is a popular optimization algorithm inspired from the basis of of melted metals’ annealing (slow 
cooling after heating) to crystallize their structures [7], it was invented in 1983 by Kirkpatrick et al. and they- 
and also some other researchers- analytically proved that SA can escape from the local optima and converges 
to the global optimum. A group of researchers studied data from students over the past decade to predict 
student performance. Data mining approaches and correlation analysis each of these approaches generate 
different levels of success. V. K. Pal and Vimal Kamlesh Kumar Bhatt [8] proposed research on the first 
dataset by applying the artificial neural network (deep learning) after splitting the data into two subsets, 
training set containing 70% of original data and test set containing the remaining 30%. The resulted accuracy 
for test set is 97.749% and the corresponding error rate is 2.251%. Y. K. Salal, S. M. Abdullaev and Mukesh 
Kumar [9] also proposed research for building classification models for the same dataset and implement 
algorithms like NaiveBayes with accuracy 73.1895%, decision tree (J48) with accuracy 76.57%, Randomtree 
with accuracy 67.95%, REPTree with accuracy 76.73%, JRp with accuracy 74.11%, OneR with accuracy 
76.73%, simplelogistic with accuracy 73.65% and ZeroR with accuracy 30.97%. After implementing these 
algorithms on the student performance dataset, He compares the implementation result for the best model in 
the prediction process. 

Another research proposed a classification method based on a meta-heuristic PSO algorithm to 
predict the students’ final outcome according to their activities and the results improved by 89% [10]. D. 
Kabakchieva [11] also proposed an algorithm for classification by applying four different classifiers: OneR 
Rule Learner, Neural Network, Decision Tree, and K-Nearest Neighbour, and neural network achieved the 
highest classification accuracy 73.9%, followed by 72.74% for the Decision Tree and 70.49% for the k-NN 
model. S. Hussain, Neama Abdulaziz Dahan, Fadl Mutaher Ba-Alwi and Najoua Ribata [12] used 
classification algorithms in WEKA and apply feature selection to select 12 of 33 attributes to predict the 
student performance. 

Optimization is the process of achieving the best solution for a problem (SVM parameters) in this 
article using SA optimization technique help in improving the objective function (classification accuracy) 
value by avoiding the local minima and present comparative study for academic students’ performance, each 
of this algorithms is compared based on its accuracy to identify the most appropriate model for this job. 
Comparing our results in section 7 with previous published works clearly show that our proposal SVM-SA 
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gives better results for accuracy, precision, sensitivity, and f-measures which improve student academic 
performance predection for the decision-makers. This paper is organized as follows: In section 2 and 3 both 
SVM classifier and SA algorithm are explained, In section 4 the proposed SVM-SA model is described, 
section 5 describes the used data, section 6 evaluation measures, section 7 and 8 the results and conclusions. 


2. SUPPORT VECTOR MACHINE 

In 1995, SVM was originally developed using the structural risk minimization principle and 
Vapnik-Chervonenkis theory, it is a supervised machine learning technique that used for both classification 
and regression problems. SVMs are more commonly used in classification problems because it has high 
performance and generalization capability. 

SVMs are based manly on the idea of finding the best hyperplane that maximizes the margin 
(distance to nearest points) between the nearest +ve and -ve data points [13], the class boundary for linearly 
separable data, giving a greater chance of new data being classified correctly [14], assume the training data 
has the dataset data={yi, xi}; i=1,2, . . ., n, where x; ERn represents the i-th candidate vector and target labels 
yi E {-1, +1}, represents the output label corresponding to the class of item xi, the original formulation of 
the SVM algorithm seeks a linear decision surface using the formulaf(x) = wx +b, where w is a 
dimensional coefficient vector and b is the offset [15]. The linear SVM achieves an optimal hyperplane by 
solving the following optimization problem: 


min > |w]? s:t--y(w'x;+b)>1, Vie {1,..,m} (1) 

w, 

This quadratic optimization problem can be solved by finding the saddle point of the Lagrangian function: 
L(w,b, aœ) = = lwl]? ~y™,a;[y(w-x,+b)—-1]  s.t...@,>0,i=1,....,m (2) 

Where a; is Lagrange variables, after applying KKT conditions for a maximum of (2) are obtained by setting 


the gradient of Lagrangian with respect to the primal variables w and b to zero and by writing the 
complenentary conditions [16]: 


VoL = w — Xita iyi Xi = 0 >w = Yin, aiyiXi (3) 
VL = — Xit ay, = 0 > Yi, diyi = 0 (4) 
Vi, ajly;(w - x; +b) -1] =0>a4,=0Vy,(w:-x,+b)-1=0 (5) 


By (3), the weight vector w solution of the SVM problem is a linear combination of the training set 
vectors X1, ..., Xm. According to complementary conditions (6) w depends on vector x; that corresponds the 
a; #0. Which called support vectors that fully define the maximum margin hyperplane, after substitute (3) 
and (4) into (2) the dual form Lagrangian Lp (a)of (2) is derived as follows: 

LaD (a) = 1 a — LEP ai a Viv; (Xi . xj) s.t..a@;20,i=1,..,mand Y™, ay; = 0 (6) 

In (7), (8) and (9) presents the polynomial kernel, sigmoid kernel, and radial basis functions, 
respectively. These functions are used to find the optimal hyperplane, in this proposal we used sigmoid 
kernel (MLP) which also called feedforward ANN with three layers of neurons each neuron uses a nonlinear 


function for activation except the input one and also applies the concept of backpropagation during the 
network training [17]. The weight, bias and 6 are the setting parameters of multilayer perceptron (MLP) [18]. 


Polynomial kernel:k(x;, x;) = (1 + X;° xj)" (7) 

sigmoid kernel: K (x;, x; ) = tanh(kx; “Xj ô) ; (8) 
where ô the intercept constant 

Radial basis function kernel (RBF):k(x;, x;) = exp(— y||x: — xl (9) 


where y is the kernel parameter 
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There are two problems in the SVM classifier’s optimization procedure [13] : 1) How to select 
relevant features and filter out irrelevant features to construct the SVM classifier; 2) How to properly adjust 
the penalty parameter C and the hyperplane parameters [19]. SVM parameters such as kernel parameters and 
the penalty parameter have a great influence on the accuracy and complexity of the classification models. 
numerus evolutionary optimization algorithms were proposed for optimizing SVMs; in this paper, SA is 
proposed as an optimization algorithm which follows search strategy that improves the value of the objective 
function to find the best parameter settings that can highly enhance the performance of SVM classifier. 


3. SIMULATED ANNEALING ALGORITHM 

SA algorithm is a local search method invented to avoid local minima [7], [20]. SA’s major 
advantage in comparison with older optimization methods is its ability to escape the local minima. This 
method based mainly on electing a move randomly in each stage instead of the best move (best neighbor) 
selection among the available moves, if the new state enhanced (reduced) the cost, it is accepted as the next 
state while if it caused the cost increment, it is accepted just with a P probability. P named Metropolis 
probability and is defined as: 


AE 


P(AE) =e7T (10) 


Where AE represents the change in energy (value of the cost function) caused by the change in state T is the 
temperature or temperature-like variable that controls this probability. A “generative function” exists that 
denotes the way of updating variables in each attempt and indeed it is the function that specifies the speed of 
convergence. In typical SA, the generative function is a Gaussian or Boltzmann function: 


AX2 


g(X) = (20T) zé “2T (11) 


where D is a dimension of the search space (number of variables in the cost function). AX shows the rate of 
change of X (variables’ vector). So, X=Xo+AX where Xo the current state and X the next state of variables. 
The temperature in the kth stage of the algorithm can be found using (12). The steps of SA algorithm [21] 
showed in Figure 2. 


To 


ig = Fe (12) 


Choosing an initial solution i from the group of feasible solution S. 

Choosing the initial temperature To>0 

Selecting the number of iterations N(t) at each temperature 

Selecting the final temperature Tr 

Determining the process of the temperature reduction until it reaches Tr. 

Setting the temperature exchange counter n fo zero for each temperature 

Creating the j solution at the neighbourhood of the i solution 

Evaluation of the objective function at any temperature and calculate A = z(j) — z(i) 
9. Accepting the solution j, if A < 0. Else, generation a random number then select solution j 
10. Setting n=n+J. If n is equal to N(t) then go to 12. Otherwise, go to 7. 

11.Reducing the temperature. If it traches Tr then stop. Otherwise, go to 6. 


OADARWNE 


Figure 2. SA algorithm 


4. RESEARCH METHOD 

Penalty and kernel are SVM parameters with a great impact on the accuracy and complexity of the 
model of classification. This paper proposes a novel evolutionary for the SVM model by using SVM with 
MLP kernel and employ SA to optimize its parameters which are expressed as P/ for the slope or weight and 
P2 for the intercept constant or bias where P1>0 and P2<0. By delinquency, the value is set at 1 and -1; 
thus, the classification error can be decreased. In this section, we describe the proposed SA-SVM model as 
shown in Figure 3 to find the optimal values of SVM parameters. 
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Figure 3. Flowchart of the proposed model (SA-SVM) 


The main steps in the SA algorithm are: 1) generating neighbor; 2) evaluating the objective function 
(classification accuracy); 3) assigning an initial temperature; 4) changing the temperature; 5) cooling 
schemes, and 6) stopping [21]. The initial solution is one of the important components of SA which 
generated randomly selected among a feasible solution space in this paper. The initial solution in our 
algorithm is represented by a two-element vector P as (13). While P/ is assigned to the weight, P2 is 
assigned to the bias. 


P = (P1, P2),where P1 € {0,100}, P2 € {—100,0} (13) 
[P1 P2] is a vector specifies the MLP kernel’ parameters of. The MLP kernel takes the form: 


k = tanh(P1 * U * V + P2) where P/ > 0 and P2 < 0. Default is [1,-1] 14) 


A feasible solution is randomly selected to be an initial solution, The objective function is an 
important factor on which SA depends during its performance for evaluating the individual solutions. We 
formulated the objective function to depend mainly on the classification accuracy of SVM represented by the 
given solution. How accurately the training data is classified when the classification is conducted using the 
parameters presented by a solution serves as the cost for a given solution. The cost Z(P) for a solution P is 
calculated over the training dataset (with a size of N) using (15). 


Z(P)= #Truly classified instances/ N (15) 


The initial temperature has also high importance as a parameter has a huge effect on the chance of 
selecting a bad solution. So, if the initial temperature has a high value, a solution with a bad objective 
function value may has a high chance of being accepted. While considering a low value for the initial 
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temperature increases the probability of the solution to be a local optimum, in this work the initial 
temperatures is chosen to be in a range from 0:500. 


5. DATA PREPROCESSING 

The input and output data were pre-processed by cleaning the missing values, convert the nominal 
data to numerical data, convert data to binary class (0 means fail, 1 means success), and splitting the dataset 
into two parts: training and testing datasets (ratio of 70%: 30%) without any feature selection for any dataset, 
we did not use cross-validation to make the comparison fair because the papers used in the comparison used 
the same ratio of training and testing data. 


6. DATA DESCRIPTION 

The first dataset consists of 649 different instances with 33 different attributes, this student’s 
performance dataset is collected from two secondary schools of Portuguese (Gabriel Pereira (GP) and 
Mousinho da Silveira (MS)). The dataset contains attributes for students like academic grades, social 
attributes, demographic attributes, and school-related attributes. Data was collected from the students using 
the school reports and questionnaires. Dataset’ details are shown in Table 1 [9] [22]. The second dataset is 
from three different colleges, Duliajan, Doomdooma, and Digboi College of Assam, India. Initially, data of 
twenty-two attributes were collected [12]. The third dataset is from the Common Entrance conducted by 
Dibrugarh University, The collected data with12 attributes were of students who came for counseling cum 
admission into medical colleges of Assam in the year 2013 [23]. The three datasets are imbalanced this is due 
to the low repetition rate among students in the database compilation places according to Table 2. 


Table 1. Information of datasets 








Dataset Instances No. of attributes 
Portuguese course 649 33 
Sapfile 300 22 
CEE-data 666 12 





Table 2. Information about no of classes 








Dataset pass Not pass 
Portuguese course 452 197 
Sapfile 224 76 
CEE-data 509 157 





7. MODEL EVALUATION 

The classification accuracy always seizes the first look when a model is built for a classification 
problem as the number of instances predicted correctly from all predictions made, but the classification 
accuracy is not sufficient alone to evaluate a model, especially in case of imbalanced data classification. 
Therefore, we considered some other measurements such as sensitivity, precision, and F-measures [24]. The 
measurement equations used in model evaluation listed in Table 3 were: TP for True positives, TN for True 
negatives, FP for false positives, and FN for False negatives [25]. 


Table 3. Measurement equations 











Measure Formula 
Fe TP+TN 
Accuracy, recognition rate 
P+N 
ae ye TP 
sensitivity, true positive rate, recall P 
S TP 
Precision —— 
TP + FP 
2x Precision x recall 
F-measures 





Precision + recall 





8. RESULTS 
The platform adopted to develop the SA-SVM algorithm is a laptop with the following features: 
Intel(R) Core (TM) i7-4600 CPU@2.10GHz, 8G RAM, a Windows 10 pro as operating system using 
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MATLAB software version R2015a. To evaluate the proposed algorithm, three standard classification 
datasets were used. The datasets were obtained from the University of California at Irvin (UCI) Machine 
Learning Repository, the properties of all datasets are listed in Table 1, each dataset split into two parts in 
which training and testing datasets wth ratio of 70% and 30% respectively and the results are shown in Table 
4, Table 5 and Table 6. 


Table 4. CEE dataset results 








CEE dataset Accuracy Sensitivity Precision F-measures 
SVM (MLP kernel) 61.8% 52.8% 58% 55.29% 
SVM (MLP kernel) &SA 69.34% 75.28% 63.2% 68.71% 





Table 5. Portuguese course dataset results 








Portuguese course dataset Accuracy Sensitivity Precision F-measures 
SVM (MLP kernel) 78.35% 76.21% 97.65% 85.6% 
SVM (MLP kernel) &SA 90.72% 97.56% 91.95% 94.67% 





Table 6. Sapfile dataset results 








Sapfile dataset Accuracy Sensitivity Precision F-measures 
SVM (MLP kernel) 61.1% 55.22% 88% 67.88% 
SVM (MLP kernel) &SA 67.77% 88% 73.15% 80.27% 





From the Tables 4 to 8, it is clear that our proposed method shows comparative performance without 
feature selection to all the other classification algorithms in term of prediction accuracy. Portuguese course 
dataset showed 78.35% accuracy for SVM with MLP kernel, this accuracy increased to 90.72% after 
applying the proposed method as well as sensitivity and f-measures as shown in Table 5 and it is better than 
other presented classifiers’ accuracy as shown in Table 7. CEE dataset showed 61.8% accuracy for SVM 
with MLP kernel, this accuracy increased to 69.34% after applying the proposed method as well as sensitivity 
and f-measures as shown in Table 4 and it is better than NaiveBayes, and decision tree (J48), ZeroR, 
REPTree, OneR, RandomTree, JRip, and SimpleLogistic accuracy as shown in Table 8. 

Sapfile dataset showed 61.1% accuracy for SVM with MLP kernel, this accuracy increased to 
67.77% after applying the proposed method as well as sensitivity and f-measures as shown in Table 6 and it 
is better than BayesNet accuracy as shown in Table 9. Also, the highly improvement in the f-measures for the 
three used datasets that reached 13% strongly proves the efficiency of our proposed SVM-SA model in 
dealing with the problem imbalance in data. All the forgoing confirms the effectiveness of the proposed 
method. 


Table 7. Portuguese course dataset comparison 








Method Accuracy Ref. Wilcoxon rank 
SVM (MLP kernel) &SA 90.72 % Proposed 1 
NaiveBayes 68.25 % [9] 6 
Decision Tree 67.79 % [9] 7 
RandomTree 53.46 % [9] 8 
REPTree 75.19 % [9] 3 
JRip 70.72 % [9] 5 
OneR 76.73 % [9] 2 
SimpleLogistic 71.34 % [9] 4 
ZeroR 30.97 % [9] 9 





Table 8. CEE dataset comparison 








Method Accuracy Ref. Wilcoxon rank 
SVM (MLP kernel) &SA 69.34% Proposed 1 
Decision Tree (J48) 64.71% [23] 2 
Naïve Bayes 57.81% [23] 4 
ZeroR 31.53% Weka 9 
REPTree 55.55% Weka 5 
SimpleLogistic 60.36% Weka 3 
OneR 51.05% Weka 7 
JRip 54.80% Weka 6 
RandomTree 33.5% Weka 8 
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Table 9. Sapfile dataset comparison 

Method Accuracy Ref. Wilcoxon rank 
SVM (MLP kernel) &SA 67.77% proposed 1 
BayesNet 65.33% [12] 2 
Naïve Bayes 51.11% Weka 5 
ZeroR 41.66% Weka 8 
REPTree 55% Weka 4 
SimpleLogistic 63.33% Weka 3 
OneR 50.66% Weka 6 
JRip 47.77.% Weka 7 
RandomTree 37.61% Weka 9 

CONCLUSION 


Machine learning techniques with educational data can be used to improve the learning process of 


students in higher education institutes. Different methods were developed by researchers to predict students’ 
performance in the enrolled courses, to provide valuable information that helps in facilitating the students’ 
retention in those courses. This information can be used by instructors to early identify students who might 


need 


assistance in their study. In our work SVM applied on three different real datasets then, a hibernation 


between SVM with MLP kernel and SA was used to enhance the results and finally, compared with the 
results of other algorithms. The results showed that the proposed method became better after applying the SA 
optimization technique and presents higher performance than other methods. 
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